Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora
نویسنده
چکیده
Large and open multiparallel corpora are a valuable resource for contrastive corpus linguists if the data is annotated and stored in a way that allows precise and flexible ad hoc searches. A linguistic query language should also support computational linguists in automated multilingual data mining. We review a broad range of approaches for linguistic query and reporting languages according to usability criteria such as expressibility, expressiveness, and efficiency. We propose an architecture that tries to strike the right balance to suit practical purposes.
منابع مشابه
Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora
The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging, lemmatisation, chunking, and dependency parsin...
متن کاملImplementing Linguistic Query Languages Using LoToS
A linguistic database is a collection of texts where sentences and words are annotated with linguistic information, such as part of speech, morphology, and syntactic sentence structure. While early linguistic databases focused on word annotations, and later also on parse-trees of sentences (so-called treebanks), the recent years have seen a growing interest in richly annotated corpora of histor...
متن کاملANNIS3: A new architecture for generic corpus query and visualization
This paper is concerned with the data structures, properties of query languages and visualization facilities required for the generic representation of richly annotated, heterogeneous linguistic corpora. We propose that above and beyond a general graph based data-model, which is becoming increasingly popular in many complex annotation formats, a well-defined concept of multiple, potentially con...
متن کاملStoring and Querying Historical Texts in a Relational Database
This paper describes an approach for storing and querying a large corpus of linguistically annotated historical texts in a relational database management system. Texts in such a corpus have a complex structure consisting of multiple text layers that are richly annotated and aligned to each other. Modeling and managing such corpora poses various challenges not present in simpler text collections...
متن کاملFinite Structure Query: A Tool for Querying Syntactically Annotated Corpora
Finite structure query (fsq for short) is a tool for querying syntactically annotated corpora. fsq employs a query language of high expressive power, namely full first order logic. It can be used to query arbitrary finite structures, not just trees.
متن کامل